Whole-Genome
sequencing technologies
Learning objectives
1. Understand the principles behind the different sequencing technologies
2. Library preparation methods and their effect on the data
3. Perform quality control (QC) on long-read sequencing data
4. Align long-reads to a reference genome
5. Call SNPs in long-read sequencing data
6. Analyze differences between sequencing technologies
7. Compare and intersect VCF files
8. Use common benchmarking metrics
Hunter, Lawrence. "Life and its molecules: A brief
introduction." AI Magazine 25.1 (2004): 9.
Why is DNA sequencing important
History of genome sequencing
1977 DNA Sequencing by Chemical
Degradation” by Allan Maxam and Walter Gilbert
1978 DNA Sequencing by Enzymatic
Synthesis” is published by Fred Sanger
1980 Fred Sanger and Walter Gilbert receive
the Nobel Prize in Chemistry
1982 GenBank starts as a public repository
of DNA sequences
1986 Leroy Hood’s laboratory at the
California Institute of Technology announces the
first semi-automated DNA sequencing machine.
1977 Genome sequence of
E. coli
is
published
2001 Draft sequence of the Human genome
is published
2004 First next generation sequencing
technologies become available
2009 Genome-wide single-cell sequencing
becomes available
Genomics technology
First-Generation Sequencing
1. Low throughput
2. Radio- or fluorescently-labelled dNTPs or oligonucleotides before electrophoretic analysis
3. Eg: Sanger sequencing
Second-Generation Sequencing
1. High throughput
2. Parallelization of a large number of reactions
3. Eg: Illumina.
Third-Generation Sequencing
1. Sequencing single molecules, so they don't require DNA amplification.
2. PacBio (Pacific Biosciences) and ONT (Oxford Nanopore).
Fourth-Generation Sequencing?
1. Spatial Genomics (eg: 10x Genomics, NanoString, Bio-Rad…)
Genomics technology
Sanger sequencing,
Genome Projects
Short reads Early Long Reads
1st era 2nd era 3rd era 4th era
Telomere-to-telomere
DNA Micr oarrays
2
nd
-generation DNA
sequencing
Since ~2007
Sanger DNA
sequencing
1977-1990s
Since mid-1990s
3
rd
-generation &
single-molecule
DNA sequencing
Since ~2010
Fred Sanger
1918-2013
“Chain termination
sequencing
Genomics technology
Sanger sequencing
1977-1990s
First practical method invented by Fred Sanger in
1977. Initially used to sequence shorter genomes,
e.g. viral genomes 10,000s of bases long.
Not-so-high-throughput Sanger sequencing
Fred Sanger in episode 3 of PBS documentary “DNA”
Sanger sequencing
Sanger sequencing
A C G T
Flourescent Dideoxy nucleotides (ddNTPs) No hydroxyl group
A C G T
Normal nucleotides
DNA template
Primer
Sanger sequencing
https://www.nature.com/articles/s41587-023-01986-3
Sanger sequencing
A
C
G
T
Sanger sequencing
A
C
G
T
Sanger sequencing
Sanger sequencing throughput
7
1977: Sanger et al
invents method
1985: ABI 370
(first
automated
sequencer)
700 bases per day
= 12,000 years to sequence the human
genome
5000 bases per day
= 1,600 years
1995: ABI 377
(Bigger gels, better
chemistry & optics,
more sensitive
dyes, faster
computers)
19,000 bases per day
= 430 years
1999: ABI 3700 (96
capillaries, 96 well
plates, fluid handling
robots)
400,000 bases per day
= 20.5 years
7
https://journals.sagepub.com/cms/10.1177/221106829800300503/asset/images/large/10.1177_221106829800300503-fig2.jpeg
Genomics technology
DNA Microarrays
2
nd
-generation DNA
sequencing
Since ~2007
Sanger DNA
sequencing
1977-1990s
Since mid-1990s
3
rd
-generation &
single-molecule
DNA sequencing
Since ~2010
Second Generation Sequencing
Common components
Flow cells as reaction chambers
Iterative sequencing process
Massive parallelization
Clonally amplified or single molecule templates
Differences
Template preparation
Sequencing chemistry
Flow cell configuration
Illumina Sequencing by synthesis
A T
G C
Double stranded
DNA (lego version)
Double stranded
DNA (double helix)
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
A T
G C
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
Single stranded
templates
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
C
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
C
DNA polymerase
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
T
C
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
A
T
C
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
T
A
T
C
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
G
T
A
T
C
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
G
G
T
A
T
C
Illumina Sequencing by synthesis
C
C
A
T
A
G
G
G
T
A
T
C
G
G
T
A
T
C
C
C
A
T
A
G
Illumina Sequencing by synthesis
CCATAGTA TATCTCGG CTCTAGGCCCTC
CCA TAGTATAT CTCGGCTCTAGGCCCTCA
CCATAGTAT ATCTCGGCTCTAG GCCCTCA
ATTTTTT
TTTTTT
TTTTTT
Input DNA
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
Cut into snippets
CCATAG TATATCT CGGCTCTAGGCCCT CATTTTTT
Deposit on slide
C
C
A
T
A
G
More details: Accurate whole human genome sequencing using reversible
terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9
Slide
(flowcell)
Illumina Sequencing by synthesis
Template
(billions of
them!)
DNA polymerase
C
A T
G
“Terminator”
Pauses replication
Illumina Sequencing by synthesis
A T
G C
Normal
nucleotides
Illumina Sequencing by synthesis
~
~
~
~
~
~
Illumina Sequencing by synthesis
Fluorescence
Illumina Sequencing by synthesis
Remove terminators
(Washing step)
DNA polymerase
C
A T
G
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Illumina Sequencing by synthesis
Cycle 1
Cycle 2 Cycle 3
Cycle 4
Cycle 5
Cycle 6
complement complement complement complement complement complement
G A T A C C
C
C
A
T
A
G
Illumina Sequencing by synthesis
Actual Il lumina HiSeq 3000 image
http://dnatech.genomecenter.ucdavis.edu/2015/05/07/first-hiseq-3000-data-download/
Illumina Sequencing by synthesis
Billions of templates on a slide
Massively parallel: photograph captures all templates simultaneously
Terminators are “speed bumps,” keeping reactions in sync
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Cluster of clones
Illumina Sequencing by synthesis
The Illumina Flowcell
Ahead of schedule
Unterminated
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Illumina Sequencing by synthesis
Base quality
Probability that
base call is incorrect
Q = 10 1 in 10 chance call is incorrect
Q = 20 1 in 100
Q = 30 1 in 1,000
Q
= -10 · log10
p
Illumina Sequencing by synthesis
Estimate p, probability incorrect:
non-orange light / total light
p = 3 green / 9 total = 1/3
Q = -10 log10 1/3 = 4.77
Call: orange (C)
Illumina Sequencing by synthesis
FASTQ format Sequencing reads
@ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1
ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT
?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G
Name
Sequence
(ignore)
+
Base qualities
Read 1
Name
Sequence
(placeholder)
Base qualities
Read 2
Name
Sequence
(placeholder)
Base qualities
Read 3
Name
Sequence
(placeholder)
Base qualities
Read 4
Name
Sequence
(placeholder)
Base qualities
Read 5
Name
Sequence
(placeholder)
Base qualities
FASTQ format Sequencing reads
Bases and qualities line up:
AGCTCTGGTGACCCATGGGCAGCTGCTAGGGA
||||||||||||||||||||||||||||||||
HHHHHHHHHHHHHHHGCGC5FEFFFGHHHHHH
Base quality is ASCII-encoded version of Q = -10 log10 p
FASTQ format Sequencing reads
FASTQ format Sequencing reads
DNA Microarrays
2
nd
-generation DNA
sequencing
Since ~2007
Sanger DNA
sequencing
1977-1990s
Since mid-1990s
3
rd
-generation &
single-molecule
DNA sequencing
Since ~2010
Genomics technology
Third Generation Sequencing
Common components
Single molecule sequencing
Long-reads
Do not require amplification
Sequencing results in real-time
Differences
Different detection mechanisms
Library preparation chemistry
Flow cell configuration
Long read platforms comparison
PacBio
Pros: accuracy, read length, detect base modifications
Cons: instrument cost
Oxford Nanopore
Pros: ultra long reads, cheap(er) instruments, direct RNA,
detect base modifications
Cons: slightly lower accuracy
Long read sequencing
Pacific Biosciences sequencing
DNA templates are captured by DNA polymerase at the
bottom of a tiny well
As the polymerase copies the template strand the
incorporation events are recorded as flashes of light
Pacific Biosciences HiFi sequencing
Clever molecular biology trick: add hairpins to
either end of DNA molecule allowing sequencing
to go around in a circle
Perform multiple passes to create a consensu s
read with ~99.9% accuracy
PacBio sequencing
PacBio sequencing
CLR:
- It sequences the DNA strand once.
- Reads are long (>30,000bp) but fairly
error prone (10-15% errors)
- Using this approach as single
molecule sequencing has low signal-
to-noise ratio
CCS:
- It sequences the strand multiple
times.
- Reads are shorter (~10,000bp) but of
high quality (0.1% errors)
PacBio sequencing
Pacific Biosciences - PacBio
Nanopore sequencing
Miniaturized sequencing device that
connects to a standard laptop
Nanopore sequencing
Oxford Nanopore Sequencing
DNA is pushed through a nanopore (a
transmembrane pore protein) using a
motor protein
The system senses electrical signal,
and every nucleotide has a different
electric current
Machine learning is used to decode
“squiggle” into a predicted read
sequence
Reads can be very long (>100kb),
~99% accurate
Nanopore sequencing
Oxford Nanopore Sequencing
DNA is pushed through a nanopore (a
transmembrane pore protein) using a motor
protein
The system senses electrical signal, and every
nucleotide has a different electric current
Machine learning is used to decode “squiggle”
into a predicted read sequence
Reads can be very long (>100kb), ~99% accurate
Figure from Goodwin et al.
Oxford Nanopore - ONT
Oxford Nanopore - ONT
Duplex sequencing Passing both DNA strands through the
pore -> Double accuracy
Oxford Nanopore - ONT
Adaptive sequencing Depleting or enriching sequences on the
fly
Oxford Nanopore - ONT
Nanopore raw current files Pod5
Fast5/Pod5 file format
Fastq/Bam format
Basecalling
First Generation Sequencing
Second Generation Sequencing
Third Generation Sequencing
Review Sequencing Technologies
Assignment 1
We will compare the characteristics and SNP calls for three
sequencing technologies:
1. Illumina
2. PacBio
3. Nanopore
The benchmarking will be done using the well-characterized
HG002
Assignment 1
Raw
Fastq
Filtered
Fastq
BAM VCF
Assignment 1
Raw
Fastq
Filtered
Fastq
BAM VCF
Fastq
QC
Quality check
Per base sequence quality
Quality check
Per tile sequence quality
Quality check
Per sequence quality scores
Quality check
Per base sequence content
Quality check
Per base sequence content
Quality check
Per sequence GC content
Quality check
Per base N content
Quality check
Sequence Length Distribution
Quality check
Sequence Duplication Levels
Quality check
Overrepresented sequences
Quality check
Adapter Content
Assignment 1
Raw
Fastq
Filtered
Fastq
BAM VCF
Fastq
QC
Read
mapping
Mapping and variant calling
BWA
SAMtools
BCFtools
The SAM/BAM format
The SAM/BAM format
The SAM/BAM format
Mapping – Read Groups
Read groups Provide technical information about flowcell and
multiplexing of illumina reads
@D00360:18:H8VC6ADXX:1:2113:12103:41717
Instrument
name
Run ID Flowcell
ID
PU = {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}
ID = {FLOWCELL_BARCODE}.{LANE}
bwa mem -R "@RG\tID:H8VC6ADXX.1\tH8VC6ADXX.1.sample1\tPL:HiSeq\tSM:Sample" reference fwd rev
Marking duplicates
Software:
- Picard
- sambamba
Software:
- Picard
- sambamba
Assignment 1
Raw
Fastq
Filtered
Fastq
BAM VCF
Fastq
QC
Read
mapping
Variant
Calling
Assignment 1
Raw
Fastq
Filtered
Fastq
BAM VCF
Fastq
QC
Read
mapping
Variant
Calling
Trimmomatic
FiltLong
BWA
minimap2
Bcftools
GATK
FreeBayes
Medaka
Clair3
Longshot
Assignment 1
Final output
Sample VCF
Benchmark VCF
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
1
.
T
G
60
PASS
GT
chr1
2
.
T
.
60
PASS
GT
chr1
3
.
G
C
60
PASS
GT
chr1
4
.
A
G
60
FAIL
GT
chr1
5
.
C
A
60
PASS
GT
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
1
.
T
G
60
PASS
GT
chr1
2
.
T
A
60
PASS
GT
chr1
3
.
G
C
60
PASS
GT
chr1
4
.
A
G
60
PASS
GT
chr1
5
.
C
T
60
PASS
GT
Assignment 1
Final output
Sample VCF
Benchmark VCF
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
1
.
T
G
60
PASS
GT
chr1
2
.
T
.
60
PASS
GT
chr1
3
.
G
C
60
PASS
GT
chr1
4
.
A
G
60
FAIL
GT
chr1
5
.
C
A
60
PASS
GT
chr1
6
.
T
C
60
PASS
GT
chr1
7
.
T
.
60
PASS
GT
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
1
.
T
G
60
PASS
GT
chr1
2
.
T
A
60
PASS
GT
chr1
3
.
G
C
60
PASS
GT
chr1
4
.
A
G
60
PASS
GT
chr1
5
.
C
T
60
PASS
GT
chr1
6
.
T
.
60
PASS
GT
chr1
7
.
T
.
60
PASS
GT
True Positive
True Negative
Assignment 1
Final output
Sample VCF
Benchmark VCF
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
1
.
T
G
60
PASS
GT
chr1
2
.
T
.
60
PASS
GT
chr1
3
.
G
C
60
PASS
GT
chr1
4
.
A
G
60
FAIL
GT
chr1
5
.
C
A
60
PASS
GT
chr1
6
.
T
C
60
PASS
GT
chr1
7
.
T
.
60
PASS
GT
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr1
1
.
T
G
60
PASS
GT
chr1
2
.
T
A
60
PASS
GT
chr1
3
.
G
C
60
PASS
GT
chr1
4
.
A
G
60
PASS
GT
chr1
5
.
C
T
60
PASS
GT
chr1
6
.
T
.
60
PASS
GT
chr1
7
.
T
.
60
PASS
GT
True Positive
True Negative
False Negative
False Positive
Assignment 1
Assignment 1
Benchmarking
- F1 score
Assignment 1
True Positive Rate (TPR)
Specificity = 1 - FPR
Assignment 1
Raw
Fastq
BAM VCF
Bcftools
GATK
FreeBayes
DeepVariant
Illumina
ONT
PacBio
BWA mem
minimap2
VCF benchmarking
VCF /Bed
Assignment 1
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
sample
chr1
100
rs123
A
G
30
PASS
GT
0/1
chr1
200
rs456
C
T
40
PASS
GT
1/1
chr1
300
.
G
A
20
PASS
GT
1/1
chr2
150
rs789
T
C
50
PASS
GT
1/1
##fileformat=VCFv4.3
##contig=<ID=chr1,length=249250621>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
chr1
99
100
rs123
chr1
199
200
rs456
chr1
299
300
.
VCF
BED
Assignment 1
Assignment 1
False Positives
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
T C C A G C C C T C A G C G T C A T G C
T C C T G C A C G C A G C G T C A T C C
Benchmark
Sample
Assignment 1
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
T C C A G C C C T C A G C G T C A T G C
T C C T G C A C G C A G C G T C A T C C
Benchmark
Sample
Chr 3 4
Chr 6 7
Chr 8 9
Chr 18 19
1. Get list of FP
into bed file
Assignment 1
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
T C C A G C C C T C A G C G T C A T G C
T C C T G C A C G C A G C G T C A T C C
Benchmark
Sample
Chr 3 4
Chr 6 7
Chr 8 9
Chr 18 19
2. Make genomic windows
for your chromosome
Chr 20
Chr 0 10
Chr 10 20
1. Get list of FP
into bed file
Assignment 1
Chr 3 4
Chr 6 7
Chr 8 9
Chr 18 19
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
T C C A G C C C T C A G C G T C A T G C
T C C T G C A C G C A G C G T C A T C C
Benchmark
Sample
3. Count number
of FP in windows
FP = 3 FP = 1
2. Make genomic windows
for your chromosome
Chr 20
Chr 0 10
Chr 10 20
1. Get list of FP
into bed file
Assignment 1
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
T C C A G C C C T C A G C G T C A T G C
T C C T G C A C G C A G C G T C A T C C
Benchmark
Sample
FP
Position
1
2
3
Assignment 1
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
T C C A G C C C T C A G C G T C A T G C
T C C T G C A C G C A G C G T C A T C C
Benchmark
Sample
FP
Position
1
2
3
FP = 3 FP = 1